DATA SCIENCE INTENSIVE :: Intro to ML in Python

An Intensive Python ML Course

Week 01: Intro to Estimation Theory

← Back to course webpage

Feedback should be send to goran.milovanovic@datakolektiv.com.

These notebooks accompany the DATA SCIENCE INTENSIVE SERIES :: Introduction to ML in Python DataKolektiv course.

Goran S. Milovanović, PhD

DataKolektiv, Chief Scientist & Owner

Aleksandar Cvetković, PhD

DataKolektiv, Consultant

Intro to Estimation Theory

Covariance and Correlation

It is not difficult to obtain the correlation coefficient from two vectors in numpy:

Numpy will always return a full correlation matrix, so:

However, we now want to understand what the correlation coefficient really is. Intuitively, we say that it describes a relationship between two variables. The correlation coefficient - more precisely, the Pearson's correlation coefficient in our case - can very from -1 to +1, describing a negative or positive, strong or week linear relationship between two variables.

You might have stumbled upon various formulas that compute this correlation coefficient. But it is easy to understand what it really us once we introduce a more elementary concept of covariance between two random variables.

Covariance. Given two random variables (RVs), $X$ and $Y$, their (sample) covariance is given by:

$$cov(X,Y) = E[(X-E[X])(Y-E[Y])] = \frac{(X-\bar{X})(Y-\bar{Y})}{N-1}$$

Of course we have  np.cov() for covariance:

The covariance of a variable with itself is its variance:

Enters z-score: the standardization of random variables.

Pearson's coefficient of correlation is nothing else than a covariance between $X$ and $Y$ upon their standardization. The standardization of a RV - widely known as a variable z-score - is obtained upon subtracting all of its values from the mean, and dividing by the standard deviation; for the i-th observation of $X$:

$$z(x_i) = \frac{x_i-\bar{X}}{\sigma}$$

Now Pearsons's correlation between v1 and v2:

So: Pearson's correlation coefficient is just the covariance between standardized variables.

Finally, to determine how much variance is shared between two variables, we square the correlation coefficient to obtain the coefficient of determination, $R^2$

We have already introduced some of the building blocks for the first model of statistical learning that we will discuss in the step: Simple Linear Regression.

In Simple Linear Regression, we discuss the model of the following functional form:

$$Y = \beta_0 + \beta_1X_1 + \epsilon $$

If we assume that the relationship between $X$ and $Y$ is indeed linear - and introduce some additional assumptions that we will discuss in our next session - the following question remains:

What values of $\beta_0$ and $\beta_1$ would pick a line in a plane spawned by $X$ and $Y$ values so that it describes the assumed linear relationship between them the best?

Linear Regression

The parameters, $\beta_0$ (intercept) and $\beta_1$ (slope)

The residuals are what is represented by the model error term $\epsilon$: they represent the difference between the observed and the predicted value:

Pearson's $R$ and $R^2$

Ok, statsmodels can do it; how do we find out about the optimal values of $\beta_0$ and $\beta_1$? Let's build ourselves a function that (a) tests for some particular values of $\beta_0$ and $\beta_1$ for a particular regression problem (i.e. for a particular dataset) and returns the model error.

The model error? Oh. Remember the residuals:

$$\epsilon_i = y_i - \hat{y_i}$$

where $y_i$ is the observation to be predicted, and $\hat{y_i}$ the actual prediction?

Next we do something similar to what happens in the computation of variance, square the differences:

$$\epsilon_i^2 = (y_i - \hat{y_i})^2$$

and define the model error for all observations to be the sum of squares:

$$SSE = \sum_{i=1}^{N}(y_i - \hat{y_i})^2$$

Obviously, the lower the $SSE$ - the Sum of Squared Error - the better the model! Here's a function that returns the SSE for a given data set (with two columns: the predictor and the criterion) and a choice of parameters $\beta_0$ and $\beta_1$:

Test lg_sse() now:

Check via statsmodels:

Method A. Random parameter space search

Check with statsmodels:

Not bad, how about 100,000 random pairs?

Method B. Grid search

Check with statsmodels:

Method C. Optimization (the real thing)

The Method of Least Squares

Check against statsmodels

Final value of the objective function (the model SSE, indeed):

Check against statsmodels

Error Surface Plot: The Objective Function

Maximum Likelihood Estimation

Check against statsmodels

Log Likelihood

Check against statsmodels

The Likelihood Function

Intro Readings and Videos

-


Goran S. Milovanović & Aleksandar Cvetković

DataKolektiv, 2022/23.

hello@datakolektiv.com

License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.